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quantify  the  dependencies,  and  introduce  a  gradient  boosting  algorithm  that  iteratively 
optimizes  an  adaptive  upper  bound  of  the  objective  function.  The  resulting 
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with  provable  convergence  guarantees.  Experimental  results  on  three  real 
world  datasets  demonstrate  that  the  mixing  rate  based  upper  bound  is  effective  for 
training  CRFs  with  non-linear  potentials. 
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Abstract 

In  this  paper,  we  present  a  gradient  boosting  algorithm  for  tree-shaped  conditional 
random  helds  (CRF).  Conditional  random  fields  are  an  important  class  of  model- 
s  for  accurate  structured  prediction,  but  effective  design  of  the  feature  functions 
is  a  major  challenge  when  applying  CRF  models  to  real  world  data.  Gradient 
boosting,  which  can  induce  and  select  functions,  is  a  natural  candidate  solution 
for  the  problem.  However,  it  is  non-trivial  to  derive  gradient  boosting  algorithms 
for  CRFs,  due  to  the  dense  Hessian  matrices  introduced  by  variable  dependen¬ 
cies.  We  address  this  challenge  by  deriving  a  Markov  Chain  mixing  rate  bound  to 
quantify  the  dependencies,  and  introduce  a  gradient  boosting  algorithm  that  itera¬ 
tively  optimizes  an  adaptive  upper  bound  of  the  objective  function.  The  resulting 
algorithm  induces  and  selects  features  for  CRFs  via  functional  space  optimiza¬ 
tion,  with  provable  convergence  guarantees.  Experimental  results  on  three  real 
world  datasets  demonstrate  that  the  mixing  rate  based  upper  bound  is  effective  for 
training  CRFs  with  non-linear  potentials. 


1  Introduction 

Many  problems  in  machine  learning  involve  structured  prediction,  i.e.  predicting  a  group  of  outputs 
that  depends  on  each  other.  Conditional  random  helds  [9]  are  among  the  most  successful  solutions 
to  these  problem.  Variants  of  tree-shaped  conditional  random  helds  have  been  proposed  and  widely 
applied  to  structured  prediction  problems  in  domains  such  as  natural  language  processing  [9,  16], 
computer  vision  [7,  15]  and  bio-informatics  [21].  As  opposed  to  classihcation  models  that  assume 
independent  output  variables,  CRF  models  capture  the  dependency  pattern  between  output  and  input 
via  potential  functions.  Potential  functions  are  usually  dehned  using  a  linear  combination  of  care¬ 
fully  engineered  features  of  the  input  and  the  output  variables.  These  feature  functions  are  crucial 
for  learning  accurate  models.  Thus,  it  is  important  to  ask  whether  we  can  induce  arbitrary  poten¬ 
tial  functions  automatically  (via  functional  space  optimization),  instead  of  manually  crafting  them 
and/or  restricting  them  to  linear  combinations. 

Gradient  boosting  [4],  which  performs  additive  training  in  functional  spaces,  is  a  natural  candidate 
for  this  problem.  Effective  gradient  boosting  algorithms,  such  as  LogitBoost  and  its  variants  [5,  11, 
17],  have  been  proposed  for  inducing  feature  functions  for  (independent)  multi-class  classihcation 
problems.  The  key  ingredient  in  these  methods  is  the  effective  use  of  second  order  information 
via  diagonal  approximation  of  Hessian  matrices.  Unfortunately,  it  is  non-trivial  to  develop  such 
boosting  methods  for  CREs,  since  variable  interdependencies  introduce  dense  Hessian  matrices  that 
make  gradient  boosting  infeasible  due  to  the  computational  complexity.  Instead,  existing  boosting 
approaches  either  optimize  approximate  objectives  [20,  12]  or  only  take  hrst  order  information  into 
account  when  optimizing  exact  likelihood  [1].  Unfortunately,  the  convergence  of  this  method  is 
guaranteed  only  with  small  step  sizes. 

In  this  paper,  we  present  a  novel  gradient  boosting  algorithm  for  inducing  non-linear  feature  func¬ 
tions  for  tree-shaped  CREs.  The  CRE  training  is  performed  via  iteratively  optimizing  adaptive  upper 
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bounds  of  the  loss  function,  to  address  the  challenge  of  dense  Hessians.  The  adaptive  bounds,  which 
are  derived  using  Markov  Chain  mixing  rates,  measure  the  dependency  between  variables,  and  ac¬ 
cordingly  control  the  conservativeness  of  the  updates.  The  resulting  gradient  boosting  algorithm, 
which  can  be  viewed  as  generalization  of  LogitBoost  to  structure  prediction  problems,  optimizes 
the  CRF  objective  with  provable  convergence  guarantees.  Experimental  results  on  three  real  world 
datasets  demonstrate  the  effectiveness  and  efficiency  of  our  bound  and  the  proposed  algorithm. 


2  Overview  of  Method 


Model  Formalization  Given  input  x,  a  CRF  model  describes  a  distribution  over  the  outputs  as 


P{y\x) 


exp{<^>{y,x)) 
'Ey'ey  exp($(t/',a;)) 


(1) 


where  y  is  the  set  of  possible  output  combinations,  and  <i)(y,a:)  captures  the  dependency  between 
the  input  and  output  variables.  The  model  <1)  usually  factorizes  as  a  sum  of  unary  and  pairwise  (edge) 
potential  functions  fi  between  individual  output  variables,  which  can  be  expressed  as  follows: 


m 

^{y,x)  ='^(j)i{x)p,,{y),  &  T.yti  &  N  \J  £  subject  to  for  (i,j)  e  C,  (2) 

i=l 


where  N  =  {l(yt  =  k)},£  =  {l(ys  =  ki,yt  =  /C2)}  are  the  sets  of  indicator  functions  of  each 
node  and  edge  state.  Each  corresponds  to  a  event  yt  —  k  or  yg  —  kx,yt  =  k2  (depending  on 
whether  is  node  or  edge  potential).  We  use  p,  as  short  hand  for  p{y)  and  view  it  as  a  vector  of 
random  variables.  The  family  of  all  possible  node  and  edge  potential  functions  is  =  Pn  U  Pe, 
whose  size  could  be  infinite.  C  represents  equivalence  classes  in  different  parts  of  the  model  that  we 
use  to  capture  parameter  sharing,  which  is  common  in  most  applications  of  CRTs.  In  standard  linear- 
chain  CRTs  with  linear  potentials,  T  contains  linear  functions  of  x,  and  C  can  be  used  to  constrain 
the  node  and  edge  potential  functions  in  different  position  to  be  the  same.  On  the  other  hand, 
LogitBoost  considers  arbitrary  T,  however  constrains  the  model  to  contain  only  node  potentials 
(there  are  no  edge  potentials).  In  this  paper  we  are  interested  in  arbitrary  function  families  T,  and 
focus  on  tree-shaped  £  that  allow  exact  inference  of  marginals. 


Training  Objective  Using  functions  f  in  Eq.(2)  as  the  model  parameters  allows  us  to  induce  poten¬ 
tial  functions  automatically  through  functional  space  optimization.  In  particular,  we  generalize  the 
standard  CRF  objective  over  training  data  22  =  {(y,  x)}  to  the  following: 


m  =  E  ^(y ;  0)  “b  ^  ^  ^(^c)  —  ^  ^  in  |x)  T  ^  ^  ■  (3) 

y,xGT>  c  y,xGD  c 

Here  I  is  the  negative  log-likelihood  function  over  each  data  point.  is  a  regularization 

term  that  measures  the  complexity  of  the  learned  function,  defined  as  a  sum  over  the  equivalence 
class  defined  by  C.  In  standard  CRFs,  for  example,  £1  is  often  the  square  of  L2  norm  of  the  parameter 
vector.  This  generalized  objective  function  encourages  us  to  select  predictive  (i.e.  optimizes  1)  and 
simple  (i.e.  optimizes  £1)  functions  as  potentials  of  a  CRF. 


Challenges  for  Function  Learning  Since  the  model  parameters  in  this  formulation  are  functions, 
Eq.(3)  cannot  be  directly  optimized  using  traditional  optimization  techniques.  Instead,  we  train 
the  model  additively:  at  each  iteration  t,  our  proposed  algorithm  first  searches  over  the  functional 
space  £F  to  find  functions  6  —  [Ji,  ^2)  • ' '  >  ^m]  that  optimize  the  objective  function  +  S), 

and  then  adds  them  to  the  current  model  ^  (^(t)  _|_  However,  due  to  the  complex  nature 

of  the  objective  function,  directly  performing  such  a  brute-force  search  requires  a  large  amount  of 
computation  and  is  thus  infeasible.  In  the  same  spirit  as  LogitBoost  [5, 11]  for  multi-class  prediction, 
we  consider  the  second  order  Taylor  expansion  of  the  negative  log-likelihood  l{y,  x,  </>): 

l{y,  X,  (()  -b  (5)  ~  l{y,  X,  f)  +  <5'^G(y,  x)  -b  x)5.  (4) 

The  gradient  G  and  Hessian  H  in  Eq.(4)  are  given  by  the  following  equation,  where  pi  and  pij  are 
short  hand  notations  fox pi  =  P{pi  =  l|a:),  pij  =  P{piPj  =  l|x): 

G*  —  Pi,  Pfij  —  Pij  PiPj'  (5) 
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Note  that  Eq.(5)  holds  for  all  i,j  pairs,  including  two  special  cases:  (1)  Ha  =  pi{l  —  pi)  when 
i  =  j,  and  (2)  =  —piPj  when  pi  and  pj  are  mutual  events.  Intuitively,  Hy  measures  the 

correlation  between  two  events,  and  is  nonzero  due  to  the  dependencies  in  the  CRF  model.  These 
dense  elements  of  the  Hessian  make  direct  optimization  of  Eq.  (4)  still  very  costly.  An  existing 
approach  to  functional  optimization  for  CREs,  presented  in  [1],  resorts  to  first  order  approximation 
to  the  loss,  and  can  only  guarantee  convergence  when  the  step  size  is  small.  An  alternative  is  to 
iteratively  update  one  5i  for  i  G  {1,  •  •  •  ,  rn}  at  a  time.  This  approach  would  require  m  inference 
steps  per  iteration,  and  is  simply  not  applicable  when  the  constraint  C  exists. 

Our  Approach  In  this  paper,  we  consider  an  upper  bound  of  Eq.  (4)  instead.  The  intuition  behind 
this  approach,  which  will  be  formalized  in  following  sections,  is  as  follows:  each  variable  in  the 
CRF  depends  weakly  on  variables  that  are  “far”  from  it.  This  motivates  the  use  of  a  diagonal  upper 
bound  of  the  Hessian  to  construct  loss  functions,  given  by  the  following  Lemma. 

Lemma  2.1.  Let  U  be  a  index  set  of  potential  functions  we  want  to  update,  and  let  7  be  a  function 
that  satisfies  the  following  inequality 

li{y,x)liu{y,x)  >^|H,,(y,x)|  (6) 

j&u 

Then  for  5  G  {[^i,  <^2)  •  ’  ’  )  ^m\  \  =  0/or  i  ^  U},  the  following  inequality  holds, 

l{y,x,(t>  +  5)  <  l{y,x,(t))  +  ^[5iGfiy,x)  +  ]--it{y,x)'H.u5j{y,x)]  +  o(^^).  (7) 

i&U 

The  detailed  proof  of  the  lemma  is  given  in  supplementary  material.  For  a  given  7  that  satisfies  the 
condition,  we  can  iteratively  optimize  L{(j),  6),  which  is  an  upper  bound  of  L(/),  defined  by 

L{(j>,S)  =  L{(f))  +  ^[{  G^{y,x))S^{y,x)  +  ^{  Y  l^{y,x)Hii)SUy,x)  +  Ll{(j)i  +  Si)-n{(f)i)].  (8) 

x,y^'D  x,y£'D 

L{(j),  5)  is  composed  of  \U\  independent  loss  functions  with  a  regularization  term,  and  can  be  used 
to  guide  the  common  function  search  (such  as  regression  tree  learning).  Iteratively  optimizing  L  will 
result  in  a  gradient  boosting  algorithm  that  ensures  the  convergence  of  L{(j))  (Proof  in  Section  4). 
Furthermore,  the  form  of  L  allows  the  search  of  Si  for  i  G  U  to  he  done  in  parallel  for  each  equivalent 
class  defined  by  C,  which  gives  us  further  computational  benefits.  In  the  next  two  sections,  we  will 
discuss  how  we  can  efficiently  estimate  7  when  U  is  the  index  set  of  all  node  potentials,  and  when  it 
is  the  index  set  of  all  edge  potentials,  using  a  mixing  rate  of  Markov  chain. 

3  Upper  Bound  Derivation  using  a  Markov  Chain  Mixing  Rate 

In  this  section,  we  will  discuss  how  we  can  estimate  7  when  U  is  the  index  set  of  all  node  potentials, 
and  the  index  set  of  all  edge  potentials.  Conceptually,  the  choice  of  7  should  be  related  to  the  inter¬ 
dependency  of  variables  in  the  current  model.  When  the  variables  in  the  model  are  independent  from 
each  other,  7  should  be  small,  and  when  the  variables  in  the  model  have  strong  dependencies,  7 
should  become  larger.  We  want  to  quantitatively  measure  the  dependencies  in  the  CRF.  Specifically, 
we  will  connect  the  dependency  level  to  the  mixing  rate  of  a  Markov  chain  defined  by  the  conditional 
distribution  of  outputs  on  input  P{y\x).  To  begin  with,  we  re-express  the  right  side  of  Eq.  (6)  using 
total  variation  distance,  defined  by  ||P  —  (3||t„  =  ^  Ex  \Pi^)  - 

Lemma  3.1.  LetlA  correspond  to  the  set  of  all  node  potentials  U  =  {j\4>j  G  N},  assuming  index  i 
corresponds  to  the  event  yt  =  k  (i.e.  pi  =  \{yt  =  k)),  then 

Y  |H*jl  =  |jP(?//a;,pi  =  1)  -  Piys\x)\\tv.  (9) 

jeu  s 

Lemma  3.2.  LetlA  correspond  to  the  set  of  all  edge  potentials  lA  =  {j\4>j  G  £},  then 

^  ^  ^  '  \\P (y B T yv\x , Pi  =  1)  P{,ysTyv\x')\\tv 

j&A  {s,v)^S 

Note  that  we  abuse  the  notation  slightly  here,  by  using  £  to  indicate  the  index  set  of  edges  in  CRF. 
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The  proof  is  a  re-arrangement  of  terms,  and  is  provided  in  the  supplementary  material.  Intuitively, 
the  total  variation  terms  in  Lemma  3.1  and  3.2  measure  how  dependent  ys  is  on  the  event  y*  =  k. 
When  ys  is  only  weakly  dependent  on  yt,  the  distance  will  be  small.  The  complexity  of  calculating 
Eq.  (9)  for  all  i  is  quadratic  in  the  number  of  nodes,  which  is  too  expensive  to  be  calculated  directly 
for  most  applications.  We  need  an  algorithm  that  scales  linearly  with  the  number  of  nodes. 

Total  variation  distance  allows  us  to  approach  the  problem  in  terms  of  dependencies  between  vari¬ 
ables.  Intuitively,  we  expect  the  dependencies  between  j/g  and  yt  to  become  smaller  as  we  change  s 
to  get  away  from  t.  We  formally  state  this  in  the  following  theorem: 

Theorem  3.1.  Mixing  rate  bound  for  Markov  chain  in  CRF 

Assume  yt,  ys  and  y^fonn  a  Markov  chain  yt  ^  ys  ^  Vv,  conditioned  on  x,  i.e.  P{yy\ys,  yt,  x)  = 
P{yv\ys,x)  holds.  Define  d{s,t,k)  =  \\P{ys\x,yt  =  k)  —  P(ys\x)\\tv  Then,  the  total  variation 
d{v,  t,  k)  can  be  bounded  by 

d{v,t,k)  <  [1 mmP{yy  =  j\ys  =  i,x)]d{s,t,k)  =  as,vd{s,t,k).  (10) 

j 

Proof  Define  notation:  Mij  =  P{yy  =  j\ys  =  i,x),Qj  =  min^  ,  then 

2d{v,t,k)  =  y2\P{yv  =j\yt  =  k,x)  -  P{yv  =  j\x)\ 
j 

=  ^  I  ^  MijP[ys  =  i\yt  =  fe,  x)  -  ^  MijP{ys  =  i\x)\ 

3  i  i 

=  51 1  ~  Qj)[Piy^  =  =  k,x)-  P{ys  =  i\x)]\ 

3  i 

<yyyy{Mij  -  Qj)\P{ys  =  i\yt  =  k,x)-  P{ys  =  i\x)\ 

3  i 

=  55(1  -  y2Qj)\P{ys  =  i\yt  =  k,x)  -  P{ys  =  i\x)\  =  k)  □ 

i  j 

The  derivation  of  Theorem  3. 1  is  inspired,  in  spirit,  by  the  mixing  rate  bounds  of  time  homogeneous 
Markov  Chains  [10]^  Intuitively,  Theorem  3.1  shows  the  dependency  decays  exponentially  as  s 
moves  away  from  t.  The  following  corollary  holds  as  the  direct  consequence  of  the  theorem. 

Corollary  3.1.  Let  q  =  [(/(I),  q(2),  •  •  •  q{n)]  be  the  path  sequence  in  £  from  t  to  s  (i.e.  g(l)  = 
f,  q(n)  =  s)  then  we  can  bound  d(t,  s,  k)  using  d{t,  t,  k)  times  the  decay  ratio  a  along  the  path, 

n—1 

d{s,  t,  k)  <  n  ^q{i)  ,q{i+l)3‘{f,t,k') .  (11) 

i 

In  the  case  when  £  is  a  chain.  Corollary  3.1  simplifies  to  d{s,  t,  k)  <  n^=t  (^h,h+id{t,  t,  k)  when 

s  >  t,  and  d(s,  t,  k)  <  nl=s-i-i  ^h,h-id{t,  t,  k)  when  s  <t.  An  important  property  of  Theorem  3.1 
is  that  the  position  specific  mixing  rate  ^  can  be  computed  efficiently  (complexity  analysis  in 
Sec  4).  We  still  need  to  calculate  d{t,  t,  k),  which  is  given  by  the  following  Lemma 

Lemma  3.3.  Let  M  correspond  to  the  index  set  of  yti  such  that  yti,  pj  are  mutual  to  each  other  (i.e. 
p-iPj  =  Ofor  i  ^  j,  i,j  G  M),  and  Pidi  ~  j|^)  “  1-  Then  the  following  identity  holds 

\  55  =  ^,x)-  P{pj  =  l|a:)|  =  1  -  P(p.i  =  l|a;)  (12) 

j&M 

The  proof  is  given  in  supplementary  material.  From  Lemma  3.3,  it  follows  that  d{t,  t,k)  =  1  —  pi. 
We  will  make  use  of  Lemma  3.3  and  Corollary  3.1  to  efficiently  estimate  7  in  next  section. 

4  Gradient  Boosting  for  CRF 

In  this  section,  we  will  present  our  gradient  boosting  algorithm.  We  will  give  estimation  of  7  for  U 
to  be  the  index  set  of  all  node  potentials  and  edge  potentials,  given  by  the  following  two  theorems. 

'Note  that  our  proof  is  actually  for  a  time  inhomogeneous  Markov  Chain. 
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Algorithm  1  Gradient  Boosting  for  CRF 


repeat 

iorU  €  {-/V,  £}  do 

for  y^x  G  V  in  parallel  do 

{inference  of  pi ,  are  done  using  dynamic  programming} 

Infer  Gi{y,  x)  G-  piiy)  -  Pi,  Uuiy,  x)  G-  Pi{l  -  p,)  for  each  i  G  U 
Infer  7i(2/,  x)  using  Theorem  4.1  and  4.2  for  each  i  Gti 

end  for 

for  [c]  C  lA  in  parallel  do 

{We  use  [c]  to  enumerate  over  set  of  equivalent  index  dehned  by  C  inU} 

(5c  ^  argmin^gjr^  +  (5)  +  \Gi{y,x)S{y,x)  G  ^i{y,x)Hu{y,x)S'^{y,x)] 

*e[c]  ’  L  J 


'Pc  4^0  A-  e^c 

end  for 
end  for 

until  convergence 


Theorem  4.1.  Let  U  be  the  index  set  of  all  node  potentials,  assume  pi  =  l{yt  =  k),  and  define  Qt 
to  be  the  set  of  all  paths  that  start  from  t,  then 

len{q)  s— 1 

7i(y,a;)  =  2(1+  ^  ^  (13) 

qeQt  s=2  i=i 

satisfies  Eq.(6).  We  use  len{q)  as  the  length  of  the  path,  a  is  defined  in  Theorem  3.1. 

Theorem  4.2.  LetlA  be  the  index  the  set  of  all  edge  potentials,  assume  pi  =  \(ijt  =  ki,  yt+i  =  kf), 
and  define  Qt,tGi  to  be  the  set  of  all  path  that  start  from  t  and  t+1  and  do  not  cross  (t,  t  +  1),  then 

len(q)  s-1 

7,(2/,  x)  =2(1  +  E  (i+En  t^q{i),q{iGl)))  (14) 

q^Qt,t+l  ^—2  t— 1 

satisfies  Eq.(6),  with  the  same  definition  oflen{q)  and  a  as  in  Theorem  4.1. 

Both  theorems  can  be  proved  by  using  Corollary  3.1  and  Lemma  3.3  to  bound  the  total  variation  dis¬ 
tance.  We  give  the  detailed  proof  in  supplementary  material.  Based  on  Theorem  4.1  and  4.2,  we  can 
get  an  efficient  gradient  boosting  algorithm  for  CRF  (GBCRF),  which  is  presented  in  Algorithm  1 . 
Here  e  is  a  shrinkage  term  used  to  control  overfitting.  Our  algorithm  adaptively  estimates  7  via  the 
mixing  rate  calculation  at  each  iteration.  At  the  beginning  of  training,  where  each  node  variable  is 
independent  from  each  other,  we  will  have  a  7  that  is  close  to  2  (and  thus  the  updates  are  aggressive). 
7  increases  as  the  variables  become  dependent  on  each  other  (inducing  more  conservative  updates). 

The  calculation  of  7  can  be  performed  using  dynamic  programming.  To  explain  the  idea  more 
clearly,  let  us  consider  the  case  when  £  is  a  chain.  In  this  case,  Eq.  (13)  specializes  into  a  calculation 
of  /3(+  =  YTs=t  nE  nLs-i-1  and  both  can  be  calculated  efficiently 

using  the  following  recursive  formula 

A"''  =  +  A'+l);  =  at,i-i(l  + /3t_i)-  (15) 

Similarly,  we  can  use  dynamic  programming  for  any  tree-shaped  8  (using  up-down  recursion).  A 
direct  consequence  of  Theorem  4. 1  is  that  we  can  bound  the  loss  using  estimation  by  number  of 
nodes  in  CRF.  Though  this  bound  is  usually  worse  than  the  bound  using  mixing  rate. 

Corollary  4.1.  When  Li  is  the  index  set  of  node  potentials  ,  ji  =  2n  satisfies  Eq.  (6),  where  n  is 
number  of  nodes  in  CRF. 

Relation  to  LogitBoost  Our  algorithm  can  be  viewed  as  a  generalization  of  multi-class  classification 
using  LogitBoost  [5].  When  the  variables  in  each  position  are  independent  (no  edge  potentials),  the 
estimation  of  7  equals  2,  and  our  algorithm  becomes  identical  to  LogitBoost.  When  the  variables 
are  dependent  on  each  other,  which  is  common  in  structured  prediction,  our  model  estimates  the 
dependency  level  via  the  Markov  Chain  mixing  rate  to  guide  the  boosting  objective  in  each  iteration. 
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Time  Complexity  The  time  complexity  for  the  gradient  boosting  statistics  collection  in  Algorithm  1 
is  0{\'D\nK'^),  where  K  is  the  number  of  states  in  each  node  and  n  is  the  average  number  of 
nodes  (e.g.  length  of  sequence)  in  each  instance.  This  is  due  to  the  fact  that  estimation  of  7  can 
be  done  in  0{\'D\nK'^)  complexity,  using  a  dynamic  programming  algorithm.  This  complexity  is 
same  as  the  complexity  for  traditional  training  methods  for  linear  CRF.  The  time  complexity  of 
entire  algorithm  is  0{\'D\nK'^  +  g{\'D\,n)),  where  g(|I?|,  n)  is  cost  of  function  learning  given  the 
statistics.  For  learning  trees,  the  complexity  of  function  learning  is  usually  O  (|I?|n log(|I?|n)). 
Thus  our  approach  extends  CRFs  to  non-linearity  with  only  an  additional  log  factor. 

Convergence  Analysis  In  this  section,  we  analyze  the  convergence  of  our  algorithm.  The  advantage 
of  our  method  is  that  it  makes  use  of  second  order  information,  and  guarantees  convergence. 

Theorem  4.3.  L{(j))  converges  with  the  procedure  described  by  Algorithm  1  for  e  <  1. 

Proof.  During  each  iteration,  assume  6*  is  the  function  that  optimizes  L{(j),  6)  defined  in  Eq.  (8), 

+  eS*)  <  L{(j>,  eS*)  <  Lif,  0)  =  L{(j>)  (16) 

Thus  the  loss  function  L  decreases  after  each  boosting  step,  and  the  algorithm  converges  to  a  mini¬ 
ma  (possibly  local  minima  when  IF  is  nonlinear)  of  L.  □ 

5  Related  Work 

Conditional  random  fields  [9]  are  among  the  most  successful  solutions  to  structured  prediction  prob¬ 
lems.  Variants  of  conditional  random  fields  have  been  proposed  and  widely  applied  for  structured 
prediction  in  domains  such  as  natural  language  processing  [9,  16],  computer  vision  [7,  15]  and  bio¬ 
informatics  [21].  Most  popular  instantiations  assume  linear  potential  functions  and  improve  the 
performance  by  carefully  engineering  features.  Our  work  focuses  on  learning  probabilistic  models 
for  tree-shaped  CRFs  with  nonlinear  potential  functions.  When  there  are  loops  in  the  CRF  and  infer¬ 
ence  is  intractable,  relaxation  of  the  objective  can  be  done  to  use  approximate  inference  and  learn¬ 
ing  [6,  13,  3].  A  similar  dependency  based  term  is  also  used  in  the  approximate  inference  [13,  3], 
but  is  usually  set  be  a  constant  value  across  all  instances  and  training  iterations.  As  a  future  work, 
it  would  be  interesting  to  explore  whether  our  adaptive  Markov  Chain  mixing  rate  bound  can  be 
applied  to  this  more  general  setting. 

Gradient  boosting  [4],  which  performs  additive  optimization  in  the  functional  space,  has  been  suc¬ 
cessfully  applied  to  classification  problems  that  assume  independent  outputs  conditioned  on  the 
input  [5].  Most  existing  attempts  to  “boost”  CRF  models  optimize  approximate  objectives  [20,  12]. 
TreeCRF  algorithm  [1]  is  similar  to  our  approach  in  that  it  directly  optimizes  the  log-likelihood  func¬ 
tion  defined  using  non-linear  potential  functions,  however  they  only  take  first  order  information  into 
account  during  optimization,  requiring  a  decreasing  step  size.  On  the  other  hand,  our  method  makes 
use  of  second  order  information,  and  guarantees  convergence  with  fixed  step  size.  Our  method  can 
also  be  viewed  as  a  generalization  of  LogitBoost  [5]  for  CRF.  It  is  worth  noting  that  the  recent  im¬ 
provements  of  LogitBoost,  which  make  use  of  adaptive  base  function  [11,  17],  can  also  potentially 
be  combined  with  our  method  to  make  further  improvements. 

6  Experiments 

In  this  section,  we  evaluate  our  method  on  named  entity  recognition,  hand  written  character  recogni¬ 
tion,  and  protein  secondary  structure  prediction.  We  compare  the  following  methods;  (1)  GBCRF  is 
the  proposed  method  in  this  paper.  We  set  to  be  a  set  of  regression  trees,  and  to  be  linear  func¬ 
tions  of  basic  transition  features  between  states;  (2)  LogitBoost  is  a  gradient  boosting  method  for 
multi-class  classification  [5]  that  does  not  support  the  dependencies  between  outputs;  (3)  TreeCRF 
is  a  gradient  boosting  method  that  only  first-order  information  [1].  We  use  the  same  family  of 
edge  and  node  potentials  as  GBCRF;  (4)  Linear  CRF  is  the  standard  CRF  model  with  linear  edge 
and  node  potentials  [9].  For  all  the  methods,  the  training  parameters  are  selected  using  a  validation 
set  or  cross  validation,  depending  on  the  specific  setup  of  each  dataset. 

Named  Entity  Recognition  We  first  test  our  methods  on  the  natural  language  task  of  named  entity 
recognition  (NER)  using  the  CoNLL-2003  shared  task  benchmark  dataset  [19].  The  dataset  contains 
around  20K  sentences,  and  defines  a  standard  split  into  14K  as  training  set,  3.3K  as  validation 
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Table  1:  FI  Measure  of  Name  Entity  Recognition  on  CoNLL-2003  Dataset.  We  use  subscript  val  to 
denote  validation  set,  and  subscript  test  to  denote  test  set. 


Word  Embedding  only 

Word  Eeatures  +  Embedding 

Method 

F^val 

Fltest 

F^val 

Fltest 

Linear  CRE 

0.8452 

0.7943 

0.8952 

0.8475 

LogitBoost 

0.8532 

0.7887 

0.8717 

0.8197 

TreeCRE 

0.8630 

0.8060 

0.8846 

0.8399 

GBCRF 

0.8801 

0.8269 

0.9015 

0.8635 

Table  2:  Cross  Validation  Error  on  Table  3:  Predictive  Q8  Accuracy  on  Protein  Sec- 

Handwritten  Character  Recognition  Dataset.  ondary  Structure  Dataset. 


Method 

Error 

Method 

Accuracy 

Linear  CRE 

0.1292  ±0.0080 

Linear  CRE 

0.614 

LogitBoost 

0.0967  ±  0.0049 

LogitBoost 

0.710 

TreeCRE 

0.0699  ±  0.0040 

TreeCRE 

0.718 

GBCRF 

0.0464  ±  0.0027 

GBCRF 

0.722 

NeuroCRE  (Do  et  al.[2]) 

0.0444 

SC-GSN  (Zhou  etal.[22]) 
SC-GSN  with  “kick-start”  ([22]) 

0.711 

0.721 

set  (also  called  development  set),  and  3.5K  sentences  as  test  set.  Traditional  approaches  for  NER 
involve  a  lot  of  time-consuming  feature  engineering  that  requires  domain  expertise,  and  build  a 
Linear  CRE  over  these  features.  Instead,  in  our  experiment,  we  explore  whether  it  is  possible  to 
perform  minimal  feature  engineering,  and  use  a  representation  learned  from  data  for  prediction. 

Specifically,  we  take  the  word  embedding  vectors  from  Mikolov  et.al  [14],  which  is  learned  from 
Google  news  corpus,  and  train  the  models  on  this  representation.  In  this  setting  ,  each  word  is 
represented  by  a  300  dimensional  vector  that  captures  the  “semantics”  of  the  word.  Eor  each  position 
in  the  sentence,  we  take  the  embedding  vector  of  the  previous,  current,  and  next  word  as  input  to  node 
potential  function.  We  call  this  setting  “word  embedding  only”.  We  further  perform  minimal  feature 
engineering  to  only  generate  the  unigram  features  (word,  postag  and  case  pattern  of  current  word). 
We  use  these  basic  features  to  train  a  weak  linear  model,  then  use  additive  training  to  boost  the  base 
model  using  the  word  embedding  representation.  We  call  this  setting  “word  feature-tembedding”. 

The  results  of  token-wise  El  evaluation  for  these  models  are  shown  in  Table  1.  Erom  the  result,  we 
see  that  GBCRE  works  better  than  Linear  CRE  in  both  settings.  The  gap  between  LogitBoost  and 
GBCRE  indicates  the  importance  of  introducing  edge  potentials  to  this  problem.  We  also  find  that 
taking  second  order  information  into  account  helps  us  obtain  a  more  accurate  model. 

Handwriting  Character  Recognition  We  also  evaluate  our  method  on  a  handwriting  recognition 
dataset^.  The  dataset  consists  of  6877  words  and  corresponds  to  about  52  thousand  handwritten 
characters  [8,  18],  each  represented  by  a  binary  pixel  vector  of  128  dimensions,  and  belonging  to 
one  of  26  alphabets.  The  dataset  is  randomly  split  into  10  folds  for  cross  validation.  We  train  the 
models  on  9  folds,  test  on  1  fold,  and  use  the  cross  validation  error  to  compare  the  methods. 

The  experiment  results  are  shown  in  Table  2.  Both  our  method  and  TreeCRE  outperforms  CRE 
with  linear  potential  functions,  which  indicates  the  effectiveness  of  introducing  a  non-linear  poten¬ 
tial  function  into  the  CRE  on  this  dataset.  The  gap  between  LogitBoost  and  models  that  consider 
dependencies  indicate  the  importance  of  incorporating  structure  information  of  the  outputs  into  the 
model.  Our  results  are  also  comparable  to  NeuroCRE  [2],  which  uses  a  deep  neural  network  as  a 
potential  function  whose  weights  are  initialized  by  Restricted  Boltzmann  Machines. 

Protein  Secondary  Structure  Prediction  We  also  conduct  an  experiment  on  protein  secondary 
structure  prediction.  The  task  is  to  predict  8-state  secondary  structure  labels  for  a  given  amino-acid 
sequence  of  a  protein.  We  use  the  protein  secondary  structure  data-set  recently  introduced  by  Zhou 
et  al.[22],  which  is  the  largest  publicly  available  protein  secondary  structure  prediction  dataset.  The 
dataset  contains  6128  proteins,  with  average  sequence  length  around  208.  We  use  exactly  the  same 
features  and  data  split  step  as  [22].  The  resulting  data  set  contains  5600  sequences  as  training  set, 
256  sequences  as  validation  set  and  272  sequences  as  test  set.  Each  position  of  the  protein  sequence 
contains  46  dimension  features  (22  for  PSSM,  22  for  sequence  and  2  for  terminals)  for  prediction. 


^http://www.seas.upenn.edu/  taskar/ocr/ 
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(a)  Evolution  of  Negative-loglikelihood  (b)  Evolution  of  average  7  estimation 

Figure  1:  Convergence  of  GBCRP  on  hand  written  character  dataset,  (a)  Convergence  comparison 
between  GBCRF  and  TreeCRF,  with  shrinkage  rate  of  both  algorithm  set  to  1.  GBCRF  converges 
faster  than  TreeCRF;  (b)  Evolution  of  different  7  estimations  on  the  CRF  model  in  each  round  based 
on  500  sequences.  Mixing  rate  based  estimation  has  the  same  trend  as  brute  force  estimation,  and 
provides  a  tighter  estimation  than  the  length  based  estimation. 

To  train  the  models,  we  take  the  concatenation  of  feature  vectors  within  3  positions  of  the  target 
position  as  input  to  the  node  potential,  resulting  in  322  input  features  in  each  position. 

The  performance  of  the  model  is  measured  by  the  accuracy  of  predictions  on  the  test  set  (denoted  as 
Q8).  We  train  the  models  with  parameters  discovered  using  the  validation  set,  and  report  the  results 
in  Table  3.  From  the  table,  we  find  that  using  trees  as  potential  functions  leads  to  better  performance 
than  restricting  the  model  to  using  linear  functions.  Our  results  are  comparable  to  the  state-of- 
art  result  in  this  dataset,  produced  by  Zhou  et  al.  [22]  (SC-GSN-31ayer).  The  result  is  generated 
by  a  deep  convolutional  generative  stochastic  network  model  to  perform  secondary  structure  label 
prediction,  optimized  with  a  “kick-start”  initialization  scheme. 

Convergence  of  GBCRF  We  further  analyze  the  convergence  of  GBCRF  on  the  handwritten  char¬ 
acter  dataset.  We  plot  the  convergence  of  negative  log-likelihood  function  of  GBCRF  and  TreeCRF 
in  Fig.  1(a).  We  find  that  GBCRF  converges  faster  than  TreeCRF,  demonstrating  that  taking  second 
order  information  into  account  not  only  gives  theoretical  guarantee  of  convergence,  but  also  helps 
the  method  to  converge  faster  in  practice. 

We  also  investigate  the  tightness  of  7  estimation.  Figure  1(b)  gives  the  average  of  different  7  esti¬ 
mations  on  models  trained  by  GBCRF  in  each  round.  Mixing  rate  based  estimation  is  the  method 
proposed  in  this  paper.  We  perform  Brute  Force  estimation  to  compute  7  exactly  using  Eq.  (9); 
the  complexity  of  this  estimation  is  quadratic  in  the  number  of  nodes  and  outputs  in  the  CRE,  and 
hence  cannot  be  used  for  most  real-world  sequences.  Average  Length  based  estimation  is  a  naive 
estimation  using  2  times  number  of  nodes  in  CRE,  and  provides  an  valid  estimation  of  7  since  it 
upper  bounds  Eq.  (13),  as  we  show  in  Corollary  4.1.  We  restrict  this  evaluation  to  the  shortest  500 
sequences,  due  to  the  computation  cost  of  brute  force  estimation.  From  the  figure,  we  find  that  mix¬ 
ing  rate  based  estimation  exhibits  the  same  trend  as  the  brute  force  estimation,  and  is  at  most  2.3 
times  higher  than  the  brute  force  estimation.  Further,  the  mixing  rate  based  bound  is  consistently 
lower  than  the  fixed  bound  computed  by  the  length  based  estimation.  These  results  indicate  that  our 
mixing  rate  based  estimation  captures  the  changes  in  the  dependencies  in  the  model  during  training 
correctly.  Hence  our  proposed  mixing  rate  based  approach  is  indeed  useful  to  estimate  7  efficiently. 

7  Conclusion 

In  this  paper,  we  present  novel  gradient  boosting  algorithm  for  CRF.  It  is  non-trival  to  design  an  ef¬ 
fective  gradient  boosting  for  CRF,  mainly  due  to  the  dense  Hessian  matrices  introduced  by  variable 
interdependency.  To  solve  the  problem,  we  make  use  of  a  Markov  Chain  mixing  rate  to  derive  an 
efficiently  computable  adaptive  upper  bound  of  the  loss  function,  and  construct  a  gradient  boosting 
algorithm  that  iteratively  optimizes  the  bound.  The  resulting  algorithm  can  be  viewed  as  a  gener¬ 
alization  of  LogitBoost  to  CRF,  thus  introducing  non-linearity  in  CRFs  at  only  a  log  factor  cost. 
Experimental  results  demonstrate  that  our  method  is  both  efficient  and  effective.  As  future  work,  it 
is  interesting  to  explore  whether  we  can  generalize  the  result  to  loopy  models. 
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Supplementary  Material 
A  Proof  for  Lemma  2.1 

Proof.  The  following  inequality  holds  for  7  that  satisfies  the  condition 

E  >  5]  5]  |H,,  1^2  =  i  5]  5]  |H,7  I (5^  +  >  5]  5]  |H,, \6,5, 

i^lA  i&A  j&A  i&A  f&A 

Applying  it  to  Talyor  expansion  in  Eq  (4),  we  have 

l[y,  x,f^S)=  l{y,  X,  (?!))  +  ^  5iGi{y,  x) 

leu 

<  l{y,x,(j))  + 

leu 

□ 


i&U  j&U 


j&A 

+  2  E 

leU 


B  Proof  for  Lemma  3.3 

Proof.  Taking  the  fact  that  pi  and  /i_,  are  mutual  for  j  f=-  i,  we  have 

=  l,a:)  -  P{yj  =  l|x)| 

j&M 

=  \Piy^  =  l\y^  =  l,x)  -  P(p*  =  l\x)\+^  \P{nj  =  l|pi  =  l,x)  -  P{fj.j  =  l|x)| 

j¥=i 

=  |1  -  P{fJ.i  =  l|x)|  +  51 =  ^1^)1 

=(1  -  P(pi  =  l|x))  +  P{yj  =  l|x) 

=2(1-P(p,  =  l|x)) 

□ 


C  Proof  for  Lemma  3.1  and  3.2 


Proof.  The  proof  is  exactly  the  same  for  both  node  and  potential  case,  we  present  the  proof  for  U 
to  be  all  node  potentials  here.  Recall  the  definition  of  H:  =  pij.  Note  that  pi  and  pij  are  short 

hand  notations  forp^  =  P{pi  =  l|x),  pij  =  P(pLipLj  =  l\x),  we  have 


1 

2p* 


Y  =  ^\P^olPi 

j  eu  j 


Pj\ 


=  '^\PiP3  =  =  l,a;)  -  P{yj 

jeu 


=  Y\^^y^  =  =  ^>2;)  -  P{ys 

s,k' 


l|x)| 

fc'|x)| 


=  51  =  fc)  -  Piys\x)\\tv 

S 


□ 
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D  Proof  for  Theorem  4.1  and  4.2 

Proof.  Proof  for  Theorem  4.1  Basically,  we  want  to  bound  the  total  variation  distance  given  by 
Eq.  (9)  in  Lemma  3.1, 

‘^Pi'^\\P{ys\x,ys  =  k)-  P{ys\x)\\fu  =  2p,'^d{s,t,k) 

s  s 

len{q) 

=  2p,[d{t,t,k)  +  E  E  d{q{s),t,k)] 

q^Qt  s—2 

len{q)  s-1 

<  2p^[d{t,t,k)  +  E  E  d{t,t,  k)  n 

q£Qt  s=2  i—1 

len{q)  s— 1 

=  2pi{l  —  Pi)[l  +  'y  ^  'y  ^  '^9(i),il(i+l)] 

qeQt  s=2  i=l 

Here  the  inequality  is  given  by  Corollary  3.1  (d{q{s),  t,  k)  <  d{t,  t,  k)  Hti 

equality  is  given  by  Lemma  3.3  (d(i,  t,k)  =  1  —  pf).  Recall  that  Ha  =  pi{l  —  pf),  we  have  proved 

Theorem  4.1.  □ 

Proof.  Proof  for  Theorem  4.2  In  this  proof,  we  will  reduce  the  total  variation  distance  between 
joint  distribution  of  edge  states  into  total  variation  distance  of  marginal  distribution  over  nodes,  as 
in  Theorem  4.1.  Assume  in  edge  pairs  are  (j/j,  yt+i),  {Vs,  ds+i)  ^  and  is  closer  to  yt+i  (without 
loss  of  generality),  then 

P{ys,ys+i\yt,yt+i,x)  =  P{ys+i\ys,x)P{ys\yt+i,x) 

We  can  convert  total  variation  by 

\\P{ys,ys+i\yt,yt+ux)  -  P{ys,ys+i\x)\\tv  =  \Piy^^ys+i\yt,yt+ux)  -  P{ys,ys+i\x)\ 

ys,ys+i 

=  E  Piys+i\ys.^)\Piys\yt+i,x)  -  Piys\x)\ 

ys,ys+i 

=  ^\P{ys\yt+i,x)  -  P{y,\x)\ 

ys 

=\\P(.ys\yt+i,x)  -  P{ys\x)\\tv 

(17) 

Now  the  case  become  same  as  node  potential,  we  can  make  use  of  Corollary  3.1  bound  the  total 
variation.  Specifically,  let  q  S  Qt,t+i  (i-e.  g(l)  S  {t,t  +  1},  q{i)  ^  {t,t  +  1}  for  i  >  1) 

len(q) 

\\Piyq(i)’yq(i+i)\yt  =  kt,yt+i  =  kt+i,x)  -  P(j/,(i),i/,(i+i)|a;)||t„ 

i  =  l 
len(q) 

=  y  ]  \\P{yq(i)\yq(l)  =  kqf^l'jjX)  —  P{yq(i)\x)\\tv 

i  =  l 

len{q)  i  —  l 

<ll^’(?/9(l)l?/9(l)  =  kq(l),x)  -  Piyq(l)\x)\\tv  +  II -^(1^9(1)  I  2^9(1)  =  kq(l),x)  -  ^(1/5(1)  |  *)  ||  H  "9{i)  ,90  +  1) 

i=2  j=l 

len(q)  i  —  l 

=  [1  “  P(yq{i)  =  2cg(i)|a;)](l  +  y  ^  iiq(j),gr{i+i)) 

le.n{q)  i—l 

<[1- P{yt  =  kt,yt+i  =  kt+i\x)]{l+  ^  H ‘^90),90+i))- 

i=2  j  =  l 

Here  the  first  inequality  is  due  to  Corollary  3.1.  Summing  the  results  over  all  q  G  Qt,t+i  will  give 
us  Eq.  (14).  □ 
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